Experts have concluded wine is deliciously tasty, but what make wine so good? With over 1500 observations of chemical, physical, and sensory data gathered with the help of wine experts the goal of this analysis is to find what properties of wine people like the most.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The dataset consists of 1599 red wines, with 12 variables. Eleven variables are chemical/physical and one variable is sensory (quality).
Explanation of variables as per [Cortez et al., 2009]:
fixed.acidity - (tartaric acid: g / dm^3) - most acids involved with wine or fixed or nonvolatile
volatile.acidity - (acetic acid: g / dm^3) - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric.acid - (g / dm^3) - found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual.sugar - (g / dm^3) - the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides - (sodium chloride: g / dm^3) - the amount of salt in the wine
free.sulfur.dioxide - (mg / dm^3) - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total.sulfur.dioxide - (mg / dm^3) - amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density - (g / cm^3) - the density of water is close to that of water depending on the percent alcohol and sugar content
pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates (potassium sulphate: g / dm^3) - a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
alcohol (% by volume) - the percent alcohol content of the wine
quality (scored 0-10)
*All Univariate Plots created have a binwidth = 0.1 unless otherwise stated
## fixed.acidity
## Min. : 4.60
## 1st Qu.: 7.10
## Median : 7.90
## Mean : 8.32
## 3rd Qu.: 9.20
## Max. :15.90
The plot of fixed.acidity is right-tailed and has range = [4.6, 15.9]. A majority of the data exists on the interval [7.1, 9.2] with values above 14 likely being outliers.
## volatile.acidity
## Min. :0.1200
## 1st Qu.:0.3900
## Median :0.5200
## Mean :0.5278
## 3rd Qu.:0.6400
## Max. :1.5800
The plot of volatile.acidity is slighty right-tailed and has two peaks. It has a range of [0.12, 1.58] with most values between [0.39, 0.64]. Values above ~1.1 are likely outliers.
*binwidth = 0.02
## citric.acid
## Min. :0.000
## 1st Qu.:0.090
## Median :0.260
## Mean :0.271
## 3rd Qu.:0.420
## Max. :1.000
The plot of citric.acid showcases many wines contain little to no citric acid, however the graph has no visible peak or tail. It has a range = [0, 1] with most values between [0.09, 0.42]. Values above 0.6 are likely outliers.
*binwidth = .01
## residual.sugar
## Min. : 0.900
## 1st Qu.: 1.900
## Median : 2.200
## Mean : 2.539
## 3rd Qu.: 2.600
## Max. :15.500
The plot of residual.sugar is right-tailed with one peak. The range = [0.9, 15.5] with most values appearing to be withing [1.9, 2.6]. Values above 4 are likely outliers.
Data was transformed to further inspect the peak and tail. The appear to be no large change from the initial inspection.
*Transformed data has a binwidth = 0.02
## chlorides
## Min. :0.01200
## 1st Qu.:0.07000
## Median :0.07900
## Mean :0.08747
## 3rd Qu.:0.09000
## Max. :0.61100
The plot for chlorides has one peak and is right-tailed. Range = [0.012, 0.611] with most data values between [0.07, 0.09]. Values above 0.12 are likely outliers.
Transformed data still showcases 1 peak, however it no longer appears to be right-tailed. There’s a apparent even distribution.
*binwidth = .01 for both graphs
## free.sulfur.dioxide
## Min. : 1.00
## 1st Qu.: 7.00
## Median :14.00
## Mean :15.87
## 3rd Qu.:21.00
## Max. :72.00
The plot for free.sulfur.dioxide is right-tailed with one peak. Range [1, 72] with most values between [7, 21].
The transformed plot has a large dip at 8 which is a bit odd. Otherwise it has a wide peak with with short tails.
*binwidth = 1 (not-trandformed)
## total.sulfur.dioxide
## Min. : 6.00
## 1st Qu.: 22.00
## Median : 38.00
## Mean : 46.47
## 3rd Qu.: 62.00
## Max. :289.00
The plot for total.sulfur.dioxide has one peak and is heavily right-tailed. The range = [6, 289] with most values between [22, 62].
Tranformed data has an apparent normal distribution.
*binwidth = 1 (not-trandformed)
## density
## Min. :0.9901
## 1st Qu.:0.9956
## Median :0.9968
## Mean :0.9967
## 3rd Qu.:0.9978
## Max. :1.0037
The plot for density is normally distrubted. The range = [.9901, 1.0037] with most values between [.9956, .9978]. For reference alcohol has a density ~.789 g / cm^3.
*binwidth = .005
## pH
## Min. :2.740
## 1st Qu.:3.210
## Median :3.310
## Mean :3.311
## 3rd Qu.:3.400
## Max. :4.010
The plot for pH is normally distrubted. The range = [2.74, 4.01] with most values between [3.21, 3.40].
*binwidth = .005
## sulphates
## Min. :0.3300
## 1st Qu.:0.5500
## Median :0.6200
## Mean :0.6581
## 3rd Qu.:0.7300
## Max. :2.0000
The plot for sulphates has one peak and is right-tailed. The range = [.33, 2] with most values between [.55, .73].
Tranformed data has an apparent normal distribution and is slightly right-tailed.
*binwidth = .05 (not-trandformed) and binwidth = .05 (transformed)
## alcohol
## Min. : 8.40
## 1st Qu.: 9.50
## Median :10.20
## Mean :10.42
## 3rd Qu.:11.10
## Max. :14.90
The plot for alcohol has generally one peak and is right-tailed. The range = [8.4, 14.9] with most values between [9.5, 11.1].
Tranformed data has the same characteristics as the not-transformed graph.
*binwidth = .01 (transformed)
The plot for quality is normally distrubuted. The range = [3, 8] with most values between [5,6]. This is a bit peculiar because wines were rated on a scale of 1-10. No wines rated below a 3 or above an 8.
*binwidth = 1
Most plots have a single peak and are right-tailed. In general the transformed plots didn’t reveal any interesting analysis.
The most intesting note was although wines were scaled from 0-10 the lowest given score is a 3 and the highest an 8. Most wines were given a score of 5 or 6.
From descriptions via [Cortez et al., 2009] and my own personal bias I am paying particular attention to certain variables:
Although the goal is to compare everything to quality let’s first explore the correllation plots.
The bottom row of quality is where I’m looking. It appears to have a moderately negative correlation with volatile.acidity (red) and a moderately postive correlation with alcohol (blue). I am suprised to see both citric.acid and total.sulfur.dioxide are not very significant.
The correlation coefficients are:
volatile.acidity to quality: -0.39
alcohol to quality : 0.48
Both coefficients showcase a moderate significance to affecting quality.
Neither volatile.acidity or alcohol are strongly correlated with quality, however splitting data by quality yields some interesting reults. Looking at the boxplots higher quality alcohol typically has more alcohol and lower volatile acidity.
How do the other plots compare to quality?
In the above image the bivarate graphs vs quailty are on the bottom row and the correlltion coefficients are the rightmost column.
As viewed in the right column, none of the other variables have any sort of significant correlation with quality (all have a magnitude <.3).
Where else is there to inspect? Let’s take a closer look at what’s strongly correlated with both volatile.acidity and alcohol (density, citric.acid, and pH) as well as sulphates (quality correlation = .25).
Higher quality wines generally have higher than ~.3 g/dm^3 of citric acid. Interesting.
Wines with extreme densities (outliers) appear to be mid quality wines. The highest quality wines typically have a lower density (likely due to the correlation with alcohol which has a density = ~.789 g/dm^3).
There doesn’t appear to be anything too interesting here. High quality wines have somewhat lower pH, but no strong observations can be made.
Values above 1.25 g/dm^3 were removed to better view overall trends. The median values of sulphates increase as quality increases.
The only feature being observed is quality, however there are intestering relationships between other features that I would like to explore.
As alcohol increases density decreases. Alcohol appears to have a lower density than the average wine.
There is a definitely negative correlation with citric.acid and volatile.acidity. I wonder how this corresponds to quality?
The variables that affect quality the most are alcohol and volatile.acidity. Their corellation coeffients are:
volatile.acidity to quality: -0.39 alcohol to quality : 0.48
Both variables are not strongly significant, at best they have a moderate influence on quality with alcohol having the higher impact. The reason volatile.acidity is low is likely due to high amounts lead to an unpleasant vinegar taste. Alcohol does have a taste, but I’m curious as to the positive correlation. The only variable that is also given as information on the label is alcohol; I wonder if the taste experts were able to read alcohol content on the label during tasting and how that affected their rating of the wine.
Minor significance can be found with citric.acid and sulphates (each with corellation coefficients ~.24). Citric acid can add ‘freshness’ to wine which is likely the cause for a positive correlation whereas sulphates act as an antrimicrobial and antioxidant.
From the variable description I was suprised not to see a strong or moderate correlation with total.sulfur.dioxide. It is described as having a strong odor and taste at high levels which I thought would affect the rating.
Other non-feature variables were compared that showed strong correlation with eachother. Both alcohol and density have a negative correlation likely due to alcohol having a lower than average density than most wines. I do not know how citric and acetic acid affect one another chemically, however the results show a negative correlation with citric.acid and volatile.acidity (acetic acid).
High quality wines (quality = 7 or 8) generally have low volatile.acidity (below .5) and high amounts of alcohol (above 10%). It appears much of the quality = 5 wines have very low alcohol and a high range of volatile acidity.
High quality wines appear to range all across the scale with most quality >= 6 wines containing higher amounts of alcohol. Most of the lower quality wines tend to have density values = [.995, 1.0] while also containing low amounts of alcohol.
There is a very apparent cluster of high quality wines with moderate levels of citric.acid (.25 and above) in a specific interval of sulphates ([.5, 1.0]). Unlike the other trend lines quality=8 wines do have a negative correlation, however I believe this is do to a clustering of quality wines.
Wines of high quality tend to follow a few rules:
It’s not a guarantee for wines that meet these specifications will be of high quality, however it appears to be a good indicator. The range for high quality sulphates does not appear to be a strong enough factor for quality.
Quality is rated on a scale of 0-10 by wine experts, however the only scores assigned were 3-8. Most wines are mid quality (~1300) with very few low quality (~70) and high quality (~220) wines.
Because of such a low count any clustering of high quality (7 or 8) or low quality (3 or 4) wines is important for analysis.
The two most corellated variables (alcohol and volatile.acidity) were compared to quality.
Quality has a positive correlation with alcohol. Most high quality wines exceed ~11% alcohol by volume and most low quality wines (3, 4, and 5) do not exceed 12% (which is the average for quality=8 wines).
Quality has a negative correaltion with volatile acidity. Usual values of volatile acidity for low quality wines [0.6, 1.0] g/dm^3 are considered to be outliers for high quality wines. High quality wines generally have volatile acidity values < 0.5 g/dm^3.
Important points from Plot 2 are that most high quality wines:
Many of the higher tier alcohols (quality=7 and quality=8) are clustered around citric acid values > 0.25 g/dm^3 and have a range of sulphates [0.5, 1.0].
There are high a few high quality wines that have lower citric acid values, however they share citric acid values with a large amount of lower quality wines.
I believe the reason for a negative trend line for quality=8 wines is due clustering and should be ignored.
Of the 1599 red wines observed only a few variables have a moderate to slightly moderate impact on the quality of a wine. The most impactful variables with their correlation values are:
Variables with positive corellation values generally affect wine positively, whereas the opposite is true with negative values.
Alcohol is a much stronger indicator for quality than any other variable. Being a common variable listed on the label of a wine I wonder if the knowledge of alcohol percentage created bias during the process of wine quality ratings.
Generally wines that are highly rated typically follow 4 main attributes (ranked highest first):
There are exceptions for high quality wines that do not adhere to these guidelines but most good wines follow at least a few.
The biggest difficulties for this data set are not having a range of wines representing the full scale they were based on (0-10). It would be interesting to have data related to what makes a wine a perfect 0 or perfect 10.
I attribute my understanding of this dataset by making a series of univariate, bivariate, and multivariate plots alongside the corellation coeficients plot. This visual information essentially guided the analysis in a way where I didn’t feel I had to guess where to go next. I would argue the data was explored at me.
To investigate this data further I would also like to know the age and average price of the wine. Age is typically a factor used to sell wine at a more expensive price, and a more expensive item can be associated with higher quality.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib